Reaction performance prediction with an extrapolative and interpretable graph model based on chemical knowledge

Accurate prediction of reactivity and selectivity provides the desired guideline for synthetic development. Due to the high-dimensional relationship between molecular structure and synthetic function, it is challenging to achieve the predictive modelling of synthetic transformation with the required extrapolative ability and chemical interpretability. To meet the gap between the rich domain knowledge of chemistry and the advanced molecular graph model, herein we report a knowledge-based graph model that embeds the digitalized steric and electronic information. In addition, a molecular interaction module is developed to enable the learning of the synergistic influence of reaction components. In this study, we demonstrate that this knowledge-based graph model achieves excellent predictions of reaction yield and stereoselectivity, whose extrapolative ability is corroborated by additional scaffold-based data splittings and experimental verifications with new catalysts. Because of the embedding of local environment, the model allows the atomic level of interpretation of the steric and electronic influence on the overall synthetic performance, which serves as a useful guide for the molecular engineering towards the target synthetic function. This model offers an extrapolative and interpretable approach for reaction performance prediction, pointing out the importance of chemical knowledge-constrained reaction modelling for synthetic purpose.


.1 Details of tested descriptors
In order to understand the baseline performances of machine learning modelling in a range of prediction tasks, we tested the predictive performances using widely applied molecular descriptors. The tested molecular descriptors include One-Hot, RDKit descriptors, Morgan Fingerprint, and Atom-centered Symmetry Functions. The tested molecular descriptors, as well as the corresponding generation parameters and package, are shown in Supplementary Table 1. All the related scripts for molecular descriptor generation are available in our GitHub project (https://github.com/Shuwen-Li/SEMG-MIGNN).

Details of GCN model
For each molecule, the molecular graph was generated. Subsequently, the baseline molecular graphs were processed by the GCN layer using dgl. The processed molecular graphs were concatenated together and passed through the sum/mean/dense layers to predict the reaction performance. Detailed workflow of the GCN model is shown in Supplementary Figure 1.

Supplementary Figure 1. Workflow of the GCN model (Graph Convolutional Network).
For each molecule, the corresponding molecular graph was generated. Subsequently, the molecular graphs were processed by the GCN layer (Graph Convolutional Network). The processed molecular graphs were then concatenated together and passed through the sum/mean/dense layers to predict the reaction performance.

Details of hyperparameter optimization for the tested machine learning models
For the tested encoding methods and machine learning algorithms, we carefully performed the hyperparameter optimization by grid search method, in order to identify the optimal hyperparameter settings.
Supplementary  To ensure the reliability of our experimental data, we first repeated one transformation using an identical chiral phosphoric acid CPA-0 from Denmark's report 20 . This transformation was reported to have a 99% ee, which corresponds to a 3.13 kcal mol -1 free energy difference. Despite extensive efforts, our repeated experiment consistently gave an enantioselectivity of 95% ee, 2.17 kcal mol -1 (Supplementary Figure 2).
We believe this mitigation of enantioselectivity raised from the influence of the racemic background reaction, probably catalyzed by the trace amount of inseparable Lewis acid. During the experimental explorations, we noticed that the imine addition is very fast, and we tried our best to eliminate the influence of background reaction. We have tried various means of purification and more stringent reaction setups, such as new glassware for each transformation, but we still cannot completely avoid the reduction of enantioselectivity. In order to make reasonable predictions using the Denmark's statistics-trained machine learning model, we applied a scaling factor of Denmark's statistics as a reasonable compromise. The mechanistic reasoning of scaling factor is elaborated as follows: Without the background reaction, ΔΔ ‡ = − ln

S7
Applying second-order Taylor expansion: Because the substrates were synthesized by the same reaction and subsequently purified, the content of the Lewis acid should be comparable, and the corresponding background reaction rate is approximately constant ( _ ). In addition, since all the chiral catalysts are CPA, the total rates of the catalytic reactions ( _ ) should also be approximately comparable ( _ is the total reaction rate under CPA catalysis, which does not require the enantioselectivity to be the same). Therefore, even there is a discrepancy in the actual enantioselectivity, we believe the  In order to find the suitable level for geometry optimization to obtain the steric encodings accurately and efficiently, we compared the geometries optimized by the MMFF, GFN2-xTB, and DFT (B3LYP/def2-SVP) methods, and representative results are shown in Supplementary Figure 3. Using the B3LYP/def2-SVP structures as reference, the root-mean-square deviation (RMSD) of the MMFF and GFN2-xTB were computed.
It was shown that MMFF is not suitable for the geometry optimization of complex molecules, giving incorrect orientations for certain key substituents (such as the highlighted ones in Supplementary Figure 3) and yielding an unsatisfying level of RMSD. In comparison, GFN2-xTB significantly improved the accuracy of geometry optimization, achieving a level close to that of DFT optimization while still meeting our requirements for high-S9 throughput virtual screening. Therefore, we eventually chose GFN2-xTB level of theory for geometry optimization.
To ensure the reliability of the GFN2-xTB geometries in terms of modelling accuracy, we further prediction tasks, the two models have comparable predictive abilities. These comparisons further supported that the selected GFN2-xTB level of theory can provide the required accuracy for geometry optimization and enable the desired machine learning modelling.

Electronics-embedded Molecular Graph and Molecular Interaction Graph Neural Network) trained by the geometries optimized at the GFN2-xTB level (a) and the B3LYP/def2-SVP level (b). The yield dataset
is randomly split to 70% (training) and 30% (test). The enantioselectivity task is randomly split to 600 (training) and 475 (test) transformations.

Evaluation of the methods for electron density calculation
In order to find the suitable level for the calculation of electron density, we have evaluated the electron densities calculated by various methods. We ultimately selected the theoretical level of B3LYP/def2-SVP to obtain the electron densities and process the model trainings.
Based on the GFN2-xTB-optimized geometries, the accuracies of the computed electron densities were evaluated for thirty-five levels of theory including the variations of five functionals (LDA-VWN, B3LYP, M06-2X, ωB97X-D, PBE0) and seven basis sets (STO-3G, STO-6G, 3-21G, def2-SVP, 6-31G(d), 6-311+G**, def2-TZVPP). The evaluation process is elaborated in Supplementary Figure 5a. For a given molecule in the studied dataset, the electron densities of the same geometry were compared between two levels of theory: the reference level (B3LYP/def2-TZVPP) and the comparing level (the other thirty-four levels). The neighboring electron density of each atom was assessed to obtain a 7x7x7 tensor with the vdW diameter size.
This creates a Nx7x7x7 tensor for the entire molecule, which was flattened into an one-dimensional vector.
Subsequently, the Euclidean distances between the two vectors were calculated to provide the quantified evaluation of the change of electron densities.   Figure 6.    CPA-26 The yield dataset is randomly split to 70% (training) and 30% (test). The enantioselectivity task is randomly split to 600 (training) and 475 (test) transformations.

Results of the yield regression performances of Baseline MG-GCN
We performed ten trials of yield prediction using the Baseline MG-GCN model. The detailed prediction results are provided in Supplementary Table 6. Entry 8 is selected as a representative example in Figure 4 of main text.

Results of the yield regression performances of SEMG-GCN
We performed ten trials of yield prediction using the SEMG-GCN model. SEMG means Sterics-and Electronics-embedded Molecular Graph. The detailed prediction results are provided in Supplementary Table   7. Entry 3 is selected as a representative example in Figure 4 of main text.

Results of the yield regression performances of Baseline MG-MIGNN
We performed ten trials of yield prediction using the Baseline MG-MIGNN model. MIGNN means Molecular Interaction Graph Neural Network. The detailed prediction results are provided in Supplementary Table 8.
Entry 8 is selected as a representative example in Figure 4 of main text.

Results of the yield regression performances of SEMG-MIGNN (Sterics-and Electronicsembedded Molecular Graph and Molecular Interaction Graph Neural Network)
We performed ten trials of yield prediction using the SEMG-MIGNN model (Sterics-and Electronicsembedded Molecular Graph and Molecular Interaction Graph Neural Network). The detailed prediction results are provided in Supplementary Table 9. Entry 6 is selected as a representative example in Figure 4 of main text.

Results of the enantioselectivity regression performances of Baseline MG-GCN
We performed ten trials of enantioselectivity prediction using the Baseline MG-GCN model. The detailed prediction results are provided in Supplementary Table 10. Entry 2 is selected as a representative example in Figure 5 of main text.

Results of the enantioselectivity regression performances of SEMG-GCN
We performed ten trials of enantioselectivity prediction using the SEMG-GCN model. SEMG means Sterics-and Electronics-embedded Molecular Graph. The detailed prediction results are provided in Supplementary Table 11. Entry 1 is selected as a representative example in Figure 5 of main text.

Results of the enantioselectivity regression performances of Baseline MG-MIGNN
We performed ten trials of enantioselectivity prediction using the Baseline MG-MIGNN model. MIGNN means Molecular Interaction Graph Neural Network. The detailed prediction results are provided in Supplementary Table 12. Entry 6 is selected as a representative example in Figure 5 of main text.

Results of the enantioselectivity regression performances of SEMG-MIGNN (Sterics-and Electronics-embedded Molecular Graph and Molecular Interaction Graph Neural Network)
We performed ten trials of enantioselectivity prediction using the SEMG-MIGNN model (Sterics-and Electronics-embedded Molecular Graph and Molecular Interaction Graph Neural Network). The detailed prediction results are provided in Supplementary Table 13. Entry 3 was selected as a representative example in Figure 5 of main text. Regression, XGBoost, and Neural Network) to predict the yields of C-N cross coupling reaction and the enantioselectivity of asymmetric N,S-acetal formation. Supplementary Figure 14 shows the results of yield predictions, and Supplementary Figure 15 shows the results of enantioselectivity predictions.

Supplementary
Using the best algorithm for each type of descriptor, hyperparameter optimization was applied. The details of the hyperparameter optimization are provided in Supplementary Table 4, and the optimization results are summarized in Supplementary Table 14 and Supplementary Table 15. The optimized hyperparameters were subsequently used in the related machine learnings.
Supplementary Figure 14. RMSEs (Root Mean Square Errors, in %) of yield predictions using widely applied molecular descriptors. The yield dataset is randomly split to 70% (training) and 30% (test). The red frame represents the best model corresponding to the four baseline descriptors.
Supplementary Figure 15. RMSEs (Root Mean Square Errors, in kcal mol -1 ) of enantioselectivity predictions using widely applied molecular descriptors. The enantioselectivity dataset is randomly split to 600 (training) and 475 (test). The red frame represents the best model corresponding to the four baseline descriptors.

Evaluation of structural sensitivity of the SEMG-MIGNN model (Sterics-and Electronicsembedded Molecular Graph and Molecular Interaction Graph Neural Network)
We tested the impact of the initial structure on the modelling performance. Ten different random seeds were applied for the generation of the molecular structures using the EmbedMolecule module of RDKit.
Subsequently, we performed the geometry optimizations and electronic structure calculations through the same process. The changes in prediction performances are summarized in Supplementary Table 16 (yield   task) and Supplementary Table 17 (enantioselectivity task), which showed marginal influence from the selection of random seed. These additional results demonstrate that the model is not sensitive to the initial random seed for structural generation. The corresponding random seeds are also provided on our GitHub repository (https://github.com/Shuwen-Li/SEMG-MIGNN) for readers to reproduce.  Figure 16). We repeated the prediction ten times, and the shadow in the figure represents the range of error in these ten predictions. In both yield and enantioselectivity tasks, the SEMG-MIGNN model (Sterics-and Electronics-embedded Molecular Graph and Molecular Interaction Graph Neural Network) can achieve an acceptable performance with 20% of the training data, and its predictive ability approached convergence with 70% or more the training data.  The pair of CPAs that has the median value of Fingerprint similarity.

Details of the tested SOTA models
For the Yield-BERT model 24 , all the prediction tasks were trained using the default parameters of the original study 24 .
For the DRFP 21 encoding, we used XGBoost algorithm as in the original study 21  None. The best parameters for enantioselectivity tasks were n_estimators= 100 and max_depth= None.
In addition, we have re-divided the yield datasets from the perspective of scaffold splitting, and evaluated the SOTA models in a series of prediction tasks. Supplementary Figure 20 shows the details of scaffold splitting in the yield dataset. For aryl halides, the substituted arenes were selected in the training set, and the pyridines were included in the test set. For Buchwald ligands, we chose the two ligands with the additional methoxy substituent as the training set and the rest two ligands as the test set. For base, the guanidine-type organic bases are used for the training set, and phosphazene are included in the test set. For the oxazole additives, we selected the mono-substituted ones as the training set and the di-substituted ones as the test set. The above scaffold-based splittings have clear organic chemistry meanings and pose extrapolative challenges from the synthetic perspective.

Enantioselectivity prediction in asymmetric N,S-acetal formation
We also compared the SOTA models in 13 enantioselectivity prediction tasks. In addition to the 9 random data splitting tasks with different ratios of training data, we also divided the imines, thiols, and catalysts based on the molecular scaffold. Supplementary Figure 22 elaborates the details of these scaffold-based data splitting. The division of imines classified imine-5 with bicyclic naphthyl substituent as the test set, while only monocyclic aryl substituents were included in the training set. Thiols were classified to aliphatic thiols (test set) and aromatic thiols (training set). For the phosphoric acid catalysts, they were divided to the training set of BINAP CPAs and the test set of H8-BINAP CPAs. In addition to these scaffold-based splitting, we also examined the transformation-based splitting; the 9 transformations involving imine-1 and thiol-1 were divided to the test set, while the remaining 16 transformations were used as the training set. The above data splitting posed a series of extrapolative challenges for the machine learning models and examined the prediction performances under application scenarios.
Supplementary Table 21 summarized the performances of DRFP, MFF, and SEMG-MIGNN models (Sterics-and Electronics-embedded Molecular Graph and Molecular Interaction Graph Neural Network).
Yield-BERT was not considered because it was not developed for enantioselectivity prediction. 24